CellBRF: a feature selection method for single-cell clustering using cell balance and random forest

Abstract Motivation Single-cell RNA sequencing (scRNA-seq) offers a powerful tool to dissect the complexity of biological tissues through cell sub-population identification in combination with clustering approaches. Feature selection is a critical step for improving the accuracy and interpretability of single-cell clustering. Existing feature selection methods underutilize the discriminatory potential of genes across distinct cell types. We hypothesize that incorporating such information could further boost the performance of single cell clustering. Results We develop CellBRF, a feature selection method that considers genes’ relevance to cell types for single-cell clustering. The key idea is to identify genes that are most important for discriminating cell types through random forests guided by predicted cell labels. Moreover, it proposes a class balancing strategy to mitigate the impact of unbalanced cell type distributions on feature importance evaluation. We benchmark CellBRF on 33 scRNA-seq datasets representing diverse biological scenarios and demonstrate that it substantially outperforms state-of-the-art feature selection methods in terms of clustering accuracy and cell neighborhood consistency. Furthermore, we demonstrate the outstanding performance of our selected features through three case studies on cell differentiation stage identification, non-malignant cell subtype identification, and rare cell identification. CellBRF provides a new and effective tool to boost single-cell clustering accuracy. Availability and implementation All source codes of CellBRF are freely available at https://github.com/xuyp-csu/CellBRF.


List of supplementary materials
The visual analysis of time-course scRNA-seq data. Fig. S2 The visual analysis of Puram dataset.     Table S6 Comparison of CellBRF with six state-of-the-art gene selection methods for single cell clustering without PCA in terms of adjusted Rand index (ARI). Table S7 In the spaces constructed based on different feature selection methods, the distribution of k -nearest neighbor consistency with different number of neighbors on all twenty-two small datasets. Table S8 Comparison of silhouettes in feature spaces constructed by different feature selection methods on twenty-two small datasets. Table S9 Clustering performance comparison of CellBRF in various scenarios on all datasets in terms of NMI and ARI. Table S10 Average clustering performance comparison of ranking-based feature selection methods on all datasets with different feature set sizes (20∼4000) in terms of ARI. Table S11 Clustering performance comparison of feature sets with different feature set sizes (20∼4000 and three-sigma rule of thumb) on all datasets in terms of ARI. Table S12 Comparison of t-SNE embedding results based on silhouette coefficient and classifier accuracy on the time-course dataset. Table S13 Comparison of t-SNE embedding results based on silhouette coefficient and classifier accuracy on the human tumor dataset. Table S14 The runtime (in seconds) of various feature selection methods on five datasets of different sizes. Table S15 A summary list of parameters used in CellBRF. Section A Stability of CellBRF with respect to parameters.    . Stability test of CellBRF by using labels with different error rates. a) The correspondence between various error rates and the evaluation criteria (NMI and ARI) on test datasets. b) Overlap comparison of genes selected based on labels with different error rates. c) The clustering performance of the final selected genes by using labels with various error rates in terms of NMI and ARI.    Section A: Stability of CellBRF with respect to parameters To evaluate the stability of CellBRF in terms of parameters, we divide the main parameters into four groups in Table S15 according to the steps in CellBRF. Then, we use different values to test each parameter while keeping the values of all other parameters unchanged on five gold-standard scRNA-seq datasets ( Supplementary Fig. S4).

L+1
Group 1: spectral clustering-based cell cluster label prediction There are three parameters in group 1, which are the number of principal components used in PCA (P ), the number of neighbors (k) in k -NN, and the number of clusters used for clustering (n).
1. the principal components (PCs) parameter P : c) PCA is used to reduce the dimensionality of scRNA-seq data, containing all expressed genes, before constructing a k-NN graph. This overcomes the curse of dimensionality and enhances the efficiency of graph construction. The number of principal components used for analysis depends on the specific dataset and the analysis goals, but it's common to use around 50 principal components in popular analysis tools like Scanpy and Seurat (Wolf F et al., 2018;Satija et al., 2015). So, we set P as 50 in CellBRF, consistent with prior studies.
To investigate the impact of P , we run our model with the parameters 40, 45, 50, 55, and 60. Fig. S4a shows the clustering performance of the final selected genes with different parameter values in terms of NMI and ARI. 2. the neighbor parameter k: k is the number of neighbors in the k -NN algorithm for constructing the cell graph. It determines the number of edges adjacent to each node. In general, a larger k value leads to a more robust performance against noise but makes boundaries between clusters less distinct. For better clustering performance and consistency with previous studies (Wang et al., 2021;Yu et al., 2022), we set k to 15 in CellBRF to ensure a relatively complete network structure. To investigate the impact of k, we run our model with the parameters 5, 10, 15, 20, and 25. Fig. S4b shows the clustering performance of the final selected genes with different parameter values in terms of NMI and ARI. For better clustering performance and consistency with previous studies (Wang et al., 2021;Yu et al., 2022), we set k to 15 in CellBRF to ensure a relatively complete single-cell network structure. 3. the number of clusters n: n is the number of clusters used in spectral clustering, which determines the granularity or level of details at which the cells are segmented. The value of n in CellBRF is determined by the authors who generated the dataset. If this parameter is not provided, Cell-BRF can automatically determine an optimal number of clusters using the gap statistic approach. We consider five n values centered on n ′ , that is, n ∈ {n ′ − 2, n ′ − 1, n ′ , n ′ + 1, n ′ + 2}. Fig. S4c shows the clustering performance of the final selected genes with different parameter values in terms of NMI and ARI.
We use this spectral clustering flow to quickly obtain cell label assignment results with high accuracy. Therefore, although it may be better to use some data-driven strategies to guide the optimal selection of each data, a small amount of improvement consumes more computing time is unnecessary. Since these parameters primarily affect the accuracy of estimated cell-type labels, we randomly change the cell types of 0%, 5%, 10%, 15%, 20%, 50%, and 100% of cells and compare the clustering performance of the final selected genes in terms of NMI and ARI ( Supplementary Fig. S5). Our experiments show that our method can select 50% of the reference genes and thus maintain stable clustering performance, provided that the cluster labels have some discriminative ability (NMI>0.2, ARI>0.2).

Group 2: data balancing
There are three parameters in group 2, which are the threshold to determine if a cluster is rare, central or major (h), the fraction retained in major clusters (U ), and the sampling rate in SMOTE (R). Where the parameters h and R are automatically determined based on data. The border cells of the cluster can be obtained through the cluster center, and removing these cells will not significantly affect the label accuracy, while retaining the specific information of the cluster, which is helpful for the feature selection step. To investigate the impact of U , we run CellBRF with varying values of U (70%, 75%, 80%, 85%, and 90%). Fig. S4d shows the NMI and ARI values obtained with different values of U . The best value is found at U = 80%. Therefore, we set U as 80% in CellBRF.
Group 3: random forest-based gene importance assessment The random forest model contains multiple parameters such as n estimators, bootstrap, and so on. n estimators is the number of trees in the forest. Using more trees in a random forest model can indeed make the model more stable and perform better in terms of accuracy and reducing overfitting. However, it's important to note that increasing the number of trees can also increase the computational time and memory required to build and use the model, which can slow down the model. After carefully considering the tradeoff between model performance and computational cost, we select 1000 trees in the random forest model. The parameter bootstrap is set to False, which means that we use the entire dataset to build each decision tree to ensure that the labels used for each tree are consistent in accuracy. For other parameters we use default values.

Group 4: redundant gene removal based on linear similarity
This group contains a dynamic threshold h ′ that gradually increases as more genes are removed. In this way, genes with high ranking and high correlation are filtered out. In order to investigate the impact of h ′ , we run our model with the different initial correlation threshold v 0.7, 0.75, 0.8, 0.85, and 0.9. Fig. S4e shows the clustering performance of the final selected genes with different parameter values in terms of NMI and ARI. In contrast, it has better performance at 0.8. Therefore, we set the initial correlation threshold v as 0.8 in CellBRF.